{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "网络爬虫一个非常炫酷，同时又非常实用的技术，这节课我们给大家介绍如何写程序去完成网络数据的抓取。 requests是一个很实用的Python HTTP客户端库，编写爬虫和测试服务器响应数据时经常会用到。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.安装"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "要安装requests，最方便快捷的方法是使用pip进行安装。在命令行执行 pip install requests"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.请求"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最简单的静态网页数据获取，需要先对网页进行请求，比如我们以腾讯新闻为例："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "result = requests.get(\"http://www.techweb.com.cn//\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.1 状态码"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "200"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result.status_code"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "200表示正常返回结果，其余的数字(比如404，500等)表示请求未拿到正常返回结果。咱们来试几个网站。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "200\n",
      "404\n",
      "302\n"
     ]
    }
   ],
   "source": [
    "def get_status(url):\n",
    "    r = requests.get(url, allow_redirects = False)\n",
    "    return r.status_code\n",
    "\n",
    "print get_status('http://www.zhidaow.com')\n",
    "#200\n",
    "print get_status('http://www.zhidaow.com/hi404/')\n",
    "#404\n",
    "print get_status('http://www.baidu.com/link?url=QeTRFOS7TuUQRppa0wlTJJr6FfIYI1DJprJukx4Qy0XnsDO_s9baoO8u1wvjxgqN')\n",
    "#302"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.2 文件编码"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们的最终目的是获取显示在网页上的内容，文本字符串有不同的编码格式，咱们可以先从返回结果里做一个简单的了解。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ISO-8859-1'"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result.encoding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.3 返回的网页内容"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<!doctype html>\n",
      "<html lang=\"zh-CN\">\n",
      "<head>\n",
      "<meta charset=\"UTF-8\"/>\n",
      "<meta name=\"mobile-agent\" content=\"format=html5; url=http://m.techweb.com.cn/\"/>\n",
      "<meta name=\"mobile-agent\" content=\"format=xhtml; url=http://m.techweb.com.cn/\"/>\n",
      "<meta name=\"baidu_union_verify\" content=\"e718eb0dd4320c884de414f6558fc60f\">\n",
      "<title>TechWeb.com.cn - 领先的互联网消费互动媒体</title>\n",
      "<meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge,chrome=1\"/>\n",
      "<meta name=\"renderer\" content=\"webkit\"/> \n",
      "<meta name=\"keywords\" content=\"互联网消费,互联网金融,互联网+,移动互联网,电子商务,手机游戏,O2O,创业创新,投资融资,智能设备,智能手机,VR及AR,AI人工智能\" />\n",
      "<meta name=\"description\" content=\"TechWeb专注于互联网消费领域，每日专业提供互联网产品、智能设备及互联网服务等方面的最新资讯，呈现为网站、微博、微信、APP等全媒体新形态，是国内领先的互联网消费互动媒体。\" />\n",
      "<link href=\"http://s2.techweb.com.cn/s\n"
     ]
    }
   ],
   "source": [
    "content = result.content\n",
    "print content[:1000] #前1000个字符"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "以上为网页的HTML格式数据，接着我们就可以从上面取出所需要的内容了，比如现在我们需要爬取的是这一页腾讯新闻的新闻标题。 我们要用到另外一个非常好用的库，叫做BeautifulSoup，我们用它完成网页返回结果的解析和内容的提取。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<title>The Dormouse's story</title>\n",
      "title\n",
      "The Dormouse's story\n",
      "head\n",
      "<p class=\"title\"><b>The Dormouse's story</b></p>\n",
      "['title']\n",
      "<a class=\"sister\" href=\"http://example.com/elsie\" id=\"link1\">Elsie</a>\n",
      "<a class=\"sister\" href=\"http://example.com/tillie\" id=\"link3\">Tillie</a>\n",
      "http://example.com/elsie\n",
      "http://example.com/lacie\n",
      "http://example.com/tillie\n"
     ]
    }
   ],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "import requests\n",
    "\n",
    "result = requests.get(\"https://www.chiphell.com/\")\n",
    "content = result.content\n",
    "\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "html = '''\n",
    "<html><head><title>The Dormouse's story</title></head>\n",
    "<body>\n",
    "<p class=\"title\"><b>The Dormouse's story</b></p>\n",
    "\n",
    "<p class=\"story\">Once upon a time there were three little sisters; and their names were\n",
    "<a href=\"http://example.com/elsie\" class=\"sister\" id=\"link1\">Elsie</a>,\n",
    "<a href=\"http://example.com/lacie\" class=\"sister\" id=\"link2\">Lacie</a> and\n",
    "<a href=\"http://example.com/tillie\" class=\"sister\" id=\"link3\">Tillie</a>;\n",
    "and they lived at the bottom of a well.</p>\n",
    "<p class=\"story\">...</p>\n",
    "'''\n",
    "\n",
    "# soup = BeautifulSoup(content,fromEncoding=\"GB2312\") #注意这个地方\n",
    "soup = BeautifulSoup(html,'lxml')\n",
    "# samples = soup.find_all(\"a\", \"linkto\")\n",
    "# samples[0]\n",
    "# print(soup.prettify())\n",
    "print(soup.title)#<title>The Dormouse's story</title>\n",
    "print(soup.title.name) # title\n",
    "print(soup.title.string) # The Dormouse's story\n",
    "print(soup.title.parent.name) # head\n",
    "\n",
    "print(soup.p) # <p class=\"title\"><b>The Dormouse's story</b></p>\n",
    "print(soup.p['class']) #['title']\n",
    "print(soup.a) #<a class=\"sister\" href=\"http://example.com/elsie\" id=\"link1\">Elsie</a>\n",
    "# print(soup.find_all('a')) # [<a class=\"sister\" href=\"http://example.com/elsie\" id=\"link1\">Elsie</a>, <a class=\"sister\" href=\"http://example.com/lacie\" id=\"link2\">Lacie</a>, <a class=\"sister\" href=\"http://example.com/tillie\" id=\"link3\">Tillie</a>]\n",
    "\n",
    "# import pandas as pd\n",
    "# df1 = pd.DataFrame(soup.find_all('a'))\n",
    "# df1\n",
    "print(soup.find(id='link3')) # <a class=\"sister\" href=\"http://example.com/tillie\" id=\"link3\">Tillie</a>\n",
    "\n",
    "\n",
    "for link in soup.find_all('a'):\n",
    "    print(link.get('href'))\n",
    "\n",
    "# print(soup.get_text())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The Dormouse's story\n"
     ]
    }
   ],
   "source": [
    "print soup.title.string.encode('utf8')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<a class=\"sister\" href=\"http://example.com/elsie\" id=\"link1\">Elsie</a>\n"
     ]
    }
   ],
   "source": [
    "print(soup.a)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "ename": "TypeError",
     "evalue": "'NoneType' object is not callable",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-78-55897d3a9ba8>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m      4\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0ms\u001b[0m \u001b[1;32min\u001b[0m \u001b[0msoup\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfind_all\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'a'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      5\u001b[0m     \u001b[0mel_data\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m{\u001b[0m\u001b[1;33m}\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 6\u001b[1;33m     \u001b[1;32mfor\u001b[0m \u001b[0mchild\u001b[0m \u001b[1;32min\u001b[0m \u001b[0ms\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mgetattrs\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m      7\u001b[0m         \u001b[0mel_data\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mchild\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mname\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mchild\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mpyval\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      8\u001b[0m     \u001b[0mdata\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mel_data\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mTypeError\u001b[0m: 'NoneType' object is not callable"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "data = []\n",
    "for s in soup.find_all('a'):\n",
    "    el_data = {}\n",
    "    for child in s.getattrs():\n",
    "        el_data[child.name] = child.pyval\n",
    "    data.append(el_data)\n",
    "pd.DataFrame(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.4 在URLs中传递参数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "有时候我们需要在URL中传递参数，比如在采集百度搜索结果时，我们wd参数（搜索词）和rn参数（搜素结果数量），你可以手工组成URL，requests也提供了一种看起来很NB的方法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "http://www.baidu.com/s?rn=10&wd=%E4%BA%94%E5%85%AB%E5%90%8C%E5%9F%8E\n"
     ]
    }
   ],
   "source": [
    "payload = {'wd':'五八同城', 'rn': '10'}\n",
    "r = requests.get(\"http://www.baidu.com/s\", params=payload)\n",
    "print r.url"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.5 设置超时时间"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们可以通过timeout属性设置超时时间，一旦超过这个时间还没获得响应内容，就会提示错误。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Response [200]>"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "requests.get('http://github.com', timeout=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.6 添加代理"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "爬虫爬取数据时为避免被封IP，经常会使用代理。requests也有相应的proxies属性。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "ename": "ProxyError",
     "evalue": "HTTPConnectionPool(host='10.10.1.10', port=3128): Max retries exceeded with url: http://www.zhidaow.com/ (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000000000A066550>: Failed to establish a new connection: [Errno 10060] ',)))",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mProxyError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-81-fa4683bf28ab>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m      5\u001b[0m   \u001b[1;34m\"https\"\u001b[0m\u001b[1;33m:\u001b[0m \u001b[1;34m\"http://10.10.1.10:1080\"\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      6\u001b[0m }\n\u001b[1;32m----> 7\u001b[1;33m \u001b[0mrequests\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mget\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"http://www.zhidaow.com\"\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mproxies\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mproxies\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;32mF:\\Program\\Anaconda2\\lib\\site-packages\\requests\\api.pyc\u001b[0m in \u001b[0;36mget\u001b[1;34m(url, params, **kwargs)\u001b[0m\n\u001b[0;32m     70\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     71\u001b[0m     \u001b[0mkwargs\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msetdefault\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'allow_redirects'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mTrue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 72\u001b[1;33m     \u001b[1;32mreturn\u001b[0m \u001b[0mrequest\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'get'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0murl\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mparams\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mparams\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     73\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     74\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mF:\\Program\\Anaconda2\\lib\\site-packages\\requests\\api.pyc\u001b[0m in \u001b[0;36mrequest\u001b[1;34m(method, url, **kwargs)\u001b[0m\n\u001b[0;32m     56\u001b[0m     \u001b[1;31m# cases, and look like a memory leak in others.\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     57\u001b[0m     \u001b[1;32mwith\u001b[0m \u001b[0msessions\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mSession\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0msession\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 58\u001b[1;33m         \u001b[1;32mreturn\u001b[0m \u001b[0msession\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mrequest\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmethod\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mmethod\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0murl\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0murl\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     59\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     60\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mF:\\Program\\Anaconda2\\lib\\site-packages\\requests\\sessions.pyc\u001b[0m in \u001b[0;36mrequest\u001b[1;34m(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)\u001b[0m\n\u001b[0;32m    506\u001b[0m         }\n\u001b[0;32m    507\u001b[0m         \u001b[0msend_kwargs\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mupdate\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msettings\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 508\u001b[1;33m         \u001b[0mresp\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msend\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mprep\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0msend_kwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    509\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    510\u001b[0m         \u001b[1;32mreturn\u001b[0m \u001b[0mresp\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mF:\\Program\\Anaconda2\\lib\\site-packages\\requests\\sessions.pyc\u001b[0m in \u001b[0;36msend\u001b[1;34m(self, request, **kwargs)\u001b[0m\n\u001b[0;32m    616\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    617\u001b[0m         \u001b[1;31m# Send the request\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 618\u001b[1;33m         \u001b[0mr\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0madapter\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msend\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mrequest\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    619\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    620\u001b[0m         \u001b[1;31m# Total elapsed time of the request (approximately)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mF:\\Program\\Anaconda2\\lib\\site-packages\\requests\\adapters.pyc\u001b[0m in \u001b[0;36msend\u001b[1;34m(self, request, stream, timeout, verify, cert, proxies)\u001b[0m\n\u001b[0;32m    500\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    501\u001b[0m             \u001b[1;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0me\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mreason\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0m_ProxyError\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 502\u001b[1;33m                 \u001b[1;32mraise\u001b[0m \u001b[0mProxyError\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0me\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mrequest\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mrequest\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    503\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    504\u001b[0m             \u001b[1;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0me\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mreason\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0m_SSLError\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mProxyError\u001b[0m: HTTPConnectionPool(host='10.10.1.10', port=3128): Max retries exceeded with url: http://www.zhidaow.com/ (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000000000A066550>: Failed to establish a new connection: [Errno 10060] ',)))"
     ]
    }
   ],
   "source": [
    "import requests\n",
    "\n",
    "proxies = {\n",
    "    \"http\": \"http://10.10.1.10:3128\",\n",
    "  \"https\": \"http://10.10.1.10:1080\",\n",
    "}\n",
    "requests.get(\"http://www.zhidaow.com\",proxies=proxies)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果代理需要账户和密码，则需这样"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [],
   "source": [
    "proxies = {\n",
    "    \"http\": \"http://user:pass@10.10.1.10:3128/\",\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.7 请求头内容"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "请求头内容可以用r.request.headers来获取。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'python-requests/2.18.4'}"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "r.request.headers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.8 模拟登陆"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "有很多网站需要我们登录之后才能完成数据的抓取，requests同样支持提交表单(包含用户名和密码)进行登录。代码的示例如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s = requests.session()\n",
    "data = {'user':'用户名','passdw':'密码'}\n",
    "#post 换成登录的地址，\n",
    "res=s.post('http://www.xxx.net/index.php?action=login',data);\n",
    "#换成抓取的地址\n",
    "s.get('http://www.xxx.net/archives/155/');"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
